Group3.2 Manuel-Ismail – STRAP, Dendroscope and own Java Tool
Universty of Konstanz

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Manuel Müller, University of Konstanz, manuel.mueller@uni-konstanz.de

Ismail Yildiz, University of Konstanz, ismail.yildiz@uni-konstanz.de

Peter Bak, University of Konstanz, peter.bak@uni-konstanz.de

 

Tool(s):

STRAP – Structure based Sequence Alignment Program.

STRAP is a comfortable and comprehensive tool to edit multiple protein sequence alignments. A wide range of functions related to protein sequences and protein structures are accessible with an intuitive graphical interface.

Author: Christoph Gille, Institute for Biochemistry , Charité Humboldt-University Berlin
Website:
http://www.bioinformatics.org/strap/

 

Dendroscope

Dendroscope is a tool for visualizing phylogentic trees and rooted networks.

Author: Daniel H. Huson
Website: http://www-ab.informatik.uni-tuebingen.de/software/dendroscope

 

Selfmade Java Tool

Our tool includes several functions like distance calculation between sequences (number of substitutions from one sequence to another)&xnbsp; and building a phylogentic tree datastructure.

Instructions: When asked for a folder, please select the folder containing the dataset.
Author: Manuel Müller, Ismail Yildiz
Link: jar

 

Video:

 

Click here

 

 

ANSWERS:


MC3.1: What is the region or country of origin for the current outbreak?  Please provide your answer as the name of the native viral strain along with a brief explanation.

 

To find the origin of the current outbreak we used the tool STRAP which calculates pairwise distances and visualizes them in a spring embedded graph.

 

springgraph1.png

Figure 1.1: Spring embedded dissimilarity graph.

 

In figure 1, the edge lengths indicate the dissimilarity of sequences. We can see that Nigeria_B is the native sequence with the shortest edge lengths to all the current outbreak sequences. This means that Nigeria_B needs the least substitutions to become one of the current outbreak sequences. So Nigeria_B is most likely the origin of the current outbreak.

 

To prove this result, we calculated the average distances from every native sequence to all current outbreak sequences and visualized it with a bar chart (Figure 1.2).

 

AverageDistanceBarChart-Natvie.png

Figure 1.2 Average distance of Native Sequences to the Current Outbreak.


MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation.

 

To identify the person who likely contracted the illness from Nicolai, we used the tool STRAP to generate a spring embedded graph for the three sequences. The edge lengths in the graph indicate the dissimilarity (number of substitutions) between the sequences.

 

springgraph2.png

Figure 2.1 Spring embedded graph

 

With this visualization we can see, that sequence 123 is more similar to 583 than 51 to 583.

Further, we used our own tool to identify the involved substitutions and got the following results:

583 à123       &xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;(A->C@269)

583à51          &xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;(A->C@494, C->T@842, T->A@946)

 

With this result we can say that it is more likely that the patient with Sequence 123 contracted the illness from Nicolai than the patient with Sequence 51, because Sequence 123 is more similar to 583(distance=1) than Sequence 51 to 583(distance=3).

 

To check our result, we can take a look at the way of evolution of the sequences.

To visualize the evolution, we used a phylogenetic tree.

 

nicolai.PNG

Figure 2.2 Phylogenetic tree (edge length represents the number of substitutions)

In Fig 2.2 we can see that >123 has the same path as Nicolai plus one additional substitution, while >51 goes a different path.


MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them. 

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

 

In our approach to identify the top 3 mutation that lead to an increase in symptom severity, we first assigned each sequence a value for its symptom severity (mild=0, moderate=1, severe=2). After that we used our Java Tool to build a phylogenetic tree data structure from the sequences. With this tree we were able to identify the sequences that include a specific substitution (which is represented as an edge) or not.


Then we calculated for every substitution a weighted average increase value using the following formula:

Including_symp_sum: sum of symptom values of sequences including the substitution.

To get a visual overview of these results, we generated a phylogentic tree with STRAP, exported and redesigned it in Dendroscope (Figure 3.1). Nodes represent Sequences, edges represent substitutions, color is mapped to symptom severity and every substitution has its weighted average increase value in square brackets.

 

 

phylotree1.png

Figure 3.1 Phylogenetic Tree Visualization of the current outbreak in Dendroscope.

 

In Figure 3.1 we marked the top 3 substitutions that lead to an increase in symptom severity with red:

 

AàT, 946 (A changed to T at position 946)
T
àC, 842 (T changed to C at position 842)
G
àA, 223 (G changed to A at position 223)

 

weighted-avg_inc3.3.png

Figure 3.2 Bar chart of weighted average increase for all substitutions.


MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

 

In our approach to identify the top 3 mutation that lead to the most dangerous viral strains, we first transformed the ordinal values from the characteristic table into numeric values (mild=0,moderate=1, severe=2). After that we used our Java Tool to build a phylogenetic tree data structure from the sequences. With this tree we were able to identify the sequences that include a specific substitution (which is represented as an edge) or not.

 

Then we calculated for each substitution a characteristic sum for every characteristic of all sequences including the substitution and a characteristic sum for all sequences excluding the substitution. The including and excluding sequences can easily be distinguished with the phylogenetic tree structure. After that we generated for each characteristic an average value by dividing the sums by the number of involved sequences. The difference of these two values indicates either an increase (positive value) or a decrease (negative value) in the specific characteristic severity. If the difference is a positive value, it means that the sequence subtree of the sequences including the substitution has a higher average characteristic sum than the sequence subtree not including the substitution. Finally we weight the average increase with the number of occurrences of a substitution. With these values we can distinguish substitutions that cause an increase or decrease in the specific characteristic. We sum these weighted average increase values for each characteristic up to get a value that indicates the overall increase of the characteristics.

To get a visual overview of these results, we imported the current outbreak sequences into STRAP and generated a phylogentic tree (Figure 4.1).

 

strap.png

Figure 4.1 Phylogenetic Tree Visualization in STRAP

 

Because of the lack of customization settings in STRAP, we decided to export the tree and import it in Dendroscope, which offers a lot more customization.

 

phylotree2.png

Figure 4.1 Phylogenetic Tree Visualization of the current outbreak in Dendroscope.

 

&xnbsp;In Figure 4.1 the nodes represent sequences, the color of the nodes is mapped to the characteristic sum (sum of the numeric values over all characteristics) and edges represent substitutions, while every substitution has its weighted average overall increase value in square brackets.

 

With this visualization we can easily find the worst substitutions that lead to an overall increase in characteristics by picking out the ones with the highest average overall increase value (The top three are marked red).

AàT, 946 (A changed to T at position 946)
T
àC, 842 (T changed to C at position 842)
T
àC, 790 (T changed to C at position 790)

&xnbsp;A complete list of the substitutions and their overall characteristic increase values are provided in Figure 4.3.

 

barchart34.png

Figure 4.3 Bar chart of weighted average overall increase for all substitutions.

The estimated time for processing the question was about 30-40 hours.